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Abstract 

We describe a technique for drawing values from discrete 
distributions, such as sampling from the random variables of 
a mixture model, that avoids computing a complete table of 
partial sums of the relative probabilities. A table of alternate 
(“butterfly-patterned”) form is faster to compute, making 
better use of coalesced memory accesses. From this table, 
complete partial sums are computed on the fly during a 
binary search. Measurements using an NVIDIA Titan Black 
GPU show that for a sufficiently large number of clusters or 
topics {K > 200), this technique alone more than doubles 
the speed of a latent Dirichlet allocation (LDA) application 
already highly tuned for GPU execution. 

Keywords butterfly, coalesced memory access, discrete 
distribution, GPU, graphics processing unit, latent Dirichlet 
allocation, LDA, machine learning, multithreading, memory 
bottleneck, parallel computing, random sampling, SIMD, 
transposed memory access 

1. Overview 

The successful use of Graphics Processing Units (GPUs) 
to train neural networks is a great example of how ma¬ 
chine learning can benefit from such massively parallel ar¬ 
chitecture. Generative probabilistic modeling HI and asso¬ 
ciated inference methods (such as Monte Carlo methods) 
can also benefit. Indeed, authors such as Suchard et al. El 
and Lee et al. m have pointed out that many algorithms of 
interest are embarrassingly parallel. However, the potential 
for massively parallel computation is only the first step to¬ 
ward full use of GPU capacity. One bottleneck that such em¬ 


barrassingly parallel algorithms run into is related to mem¬ 
ory bandwidth; one must design key probabilistic primitives 
with such constraints in mind. 

We address the important case wherein parallel threads 
must draw from distinct discrete distributions in a SIMD 
fashion. This can arise when implementing any mixture 
model, and Latent Dirichlet Allocation (LDA) models in 
particular, which are probabilistic mixture models used to 
discover abstract “topics” in a collection of documents (a 
corpus) G). This model can be fitted (or “trained”) in an 
unsupervised fashion using sampling methods lUl chapter 
iiiii. Each document is modeled as a distribution 9 over 
topics, and each word in a document is assumed to be drawn 
from a distribution (p of words. Understanding the methods 
described in this paper does not require a deep understanding 
of sampling algorithms for LDA. What is important is that 
each word in a corpus is associated with a so-called “latent” 
random variable |[T] chapter 9], usually referred to as z, that 
takes on one of K integer values, indicating a topic to which 
the word belongs. Broadly speaking, the iterative training 
process works by tentatively choosing a topic (that is, sam¬ 
pling the random latent variable z) for a given word using 
relative probabilities calculated from 9 and (p, then updating 
9 and (p accordingly. 

In this paper, we focus on the step that draws new z values 
from a finite domain, given arrays of (typically floating¬ 
point) parameters 9 and (p, typically using the following 
four-step process for each new z value to be drawn; 

1. Use the parameters to construct a table of relative (un¬ 
normalized) probabilities for the possible choices for z, 
where each relative probability is a product of some ele¬ 
ment of 9 and some element of p. 

2 . Normalize these table entries by dividing each entry by 
the sum of all entries. 

3. Let u be chosen uniformly at random from the real in¬ 
terval [0,1). 

4. Find the smallest index j such that the sum of all table 
entries at or below index j is larger than u. 
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In practice, this sequence of steps may be optimized by 
doing a bit of algebra and using a binary search: 

1. Use the parameters to construct a table of relative prob¬ 
abilities for the possible choices for the z value. 

2. Replace this table with its sum-prebx: each entry is re¬ 
placed by the sum of itself and all earlier entries. The last 
entry therefore becomes the sum of all the original entries. 

3. Let u be chosen uniformly at random from the real in¬ 
terval [0,1). 

4. Use a binary search to find the smallest index j such that 
the entry at index j is larger than u times the last entry. 

When this algorithm is implemented in parallel on a SIMD 
GPU, an obvious approach is to assign the computation of 
each z value to a separate thread. However, when the threads 
fetch their respective (p values, the values to be fetched 
will likely reside at unrelated locations in memory, resulting 
in poor memory-fetch performance. A standard trick is to 
have all the threads in a warp cooperate with each other 
(compare, for example, the storage of floating-point numbers 
as “slicewise” rather than “fleldwise” in the architecture 
of the Connection Machine Model CM-2, so that 32 1-bit 
processors cooperate on each of 32 clock cycles to fetch 
and store an entire 32-bit floating-point value that logically 
belongs to just one of the 32 processors ||6l). Suppose there 
are W threads in a warp (a typical value is lU = 32), 
each working on a different document (and therefore using 
different discrete probability distributions), and suppose that 
at a certain step of the algorithm each thread needs to fetch 
W different p values. The array of (p values can be organized 
so that the W different (p values needed by any one thread 
are stored in consecutive memory locations. The trick is that 
each thread performs W fetches from the <p table, as before, 
but instead of each thread fetching its own p values in all 
cases, on cycle k all the threads work together to fetch the 
W different p values needed by thread k. As a result, on 
each memory cycle all W values being fetched are in a 
contiguous region of memory, allowing improved memory- 
fetch performance. 

It is then necessary for the threads to exchange informa¬ 
tion among themselves so that the rest of the algorithm may 
be carried out, including the summation arithmetic. 

The binary search does not access all entries in the prefix- 
sum table (in fact, for a table of size N it examines only 
about log 2 N entries). Therefore it is not necessary to com¬ 
pute all entries of the prefix-sum table. We present an alter¬ 
nate technique that computes a “butterfly-patterned” partial- 
sums table, using less computational and communication ef¬ 
fort; a modified binary search then uses this table to com¬ 
pute, on the fly, entries that would have been in the original 
complete prefix-sum table. This requires more work per ta¬ 
ble entry during the binary search, but because the search ex¬ 
amines only a few table entries, the result is a dramatic net 
reduction in execution time. This technique may be effec¬ 
tive for collapsed LDA Gibbs samplers uni ns ED as well 


as uncollapsed samplers, and may also be useful for GPU 
implementations of other algorithms ll^ whose inner loops 
sample from discrete distributions. 

2. Basic Algorithm 

For a Latent Dirichlet Allocation model of, for example, a 
set of documents to which we want to assign topics probab- 
listically using Gibbs sampling, let M be the number of doc¬ 
uments, K be the number of topics, and V be the size of the 
vocabulary, which is a set of distinct words. Each document 
is a bag of words, each of which belongs to the vocabulary; 
any given word can appear in any number of documents, and 
may appear any number of times in any single document. 
The documents may be of different lengths. 

We are interested in the phase of an uncollapsed Gibbs 
sampler that draws new z values, given 9 and p distributions. 
Because no z value directly depends on any other z value in 
this formulation, new z values may all be computed inde¬ 
pendently (and therefore in parallel to any extent desired). 

We assume that we are given an M x K matrix 9 and a 
V X K matrix p-, the elements of these matrices are non¬ 
negative numbers, typically represented as floating-point 
values. Row m of 9 (that is, 9[m, •]) is the (currently as¬ 
sumed) distribution of topics for document m, that is, the 
relative probabilities (weights) for each of the K possible 
topics to which the document might be assigned. Note that 
columns of 9 are not to be considered as distributions. Simi¬ 
larly, column kofp (that is, p[-,k) is the (currently assumed) 
distribution of words for topic k, that is, the weights with 
which the V possible words in the vocabulary are associated 
with the topic. Note that rows of p are not to be considered 
as distributions. We organize 9 as rows and p as columns 
for engineering reasons; we want the K entries obtained by 
ranging over all possible topics to be contiguous in memory 
so as to take advantage of memory cache structure. 

We also assume that we are given (i) a length-M vector 
of nonnegative integers N such that N[m] is the length of 
document m, and (ii) an MxN ragged array w, by which we 
mean that for 0 < m < M, w[m] is a vector of length N[m]. 
(We use zero-based indexing throughout this document.) 
Each element of w is less than V and may therefore be used 
as a first index for p. 

Our goal, given K, M, V, N, p, 9, and w and assuming 
the use of a temporary M x N x K ragged work array 
a (which we will later optimize away), is to compute all 
the elements for an M x N ragged array z as follows: 
Eor all m such that 0 < m < M and for all i such that 
0 < i < N[m], do two things: first, for all k such that 
0 < fc < AT, let = 9[m,k]xp[w[m,i],k]', second, 

let z[m,i] be a nonnegative integer less than K, chosen 
randomly in such a way that the probability of choosing 
the value k' is o[m][i][fc']/cr where a = ^ 

0<k<K 

Thus, is a relative (unnormalized) probability, 

and a[m][i][fc']/cr is an absolute (normalized) probability. 


Algorithm 1 Drawing new z values 
1: procedure D R AWZ {N[Mie[M,K\,(j)[V,K], 

2: r(;[M][7V]; output z[M, A”]) 

3: local array a [M] [ A] [AT], p [M] [ A] [A] 

4: for all 0 < TO < M do 

5: for all 0 < i < A[to] do 

6: > Compute O-cj) products 

7: for all 0 < fc < A do 

8: a[TO][i][fc] ^ 9[nn,k] x (/)[w[TO][i], A:] 

9: end for 

10: > Compute partials sums of the products 

11: let sum -It- 0.0 

12: for fc from 0 through A — 1 do 

13: sumsum + a[m][i][k] 

14: p[m][i][k] <— sum 

15: end for 

16: let j ^ 0 

17: (search the table of partial sums) 

18: z[m,i]^j 

19: end for 

20: end for 
21: end 


Algorithm [T] is a basic implementation of this process. 
We remark that a “let” statement creates a local binding of 
a scalar (single-valued) variable and gives it a value, that a 
“local array” declaration creates a local binding of an array 
variable (containing an element value for each indexable 
position in the array), and that distinct iterations of a “for” 
or “for all” construct are understood to create distinct and 
independent instantiations of such local variables for each 
iteration. The iterations of “for ... from ... through ...” are 
executed sequentially in a specific order; but the iterations 
of a “for all” construct are intended to be computationally 
independent and therefore may be executed in any order, or 
in parallel, or in any sequential-parallel combination. We use 
angle brackets to indicate the use of a “code chunk” that is 
defined as a separate algorithm; such a use indicates that 
the definition of the code chunk should be inserted at the 
use site, as if it were a C macro, but surrounded by begin 
and end (this is a programming-language technicality that 
ensures that the scope of any variable declared within the 
code chunk is confined to that code chunk). 

The computation of the 6-(j) products (lines |3^2lof Algo¬ 
rithm n is straightforward. The computation of partial sums 
(lines flTHTSl l is sequential; the variable sum accumulates the 
products, and successive values of sum are stored into the 
array p. A random integer is chosen for z[m, i] by choosing 
a random value uniformly from the range [0, 0,1.0), scaling 
it by the final value of sum (which has the same algorith¬ 
mic effect as dividing each p[TO][i][fc] by that value, for all 
0 < A: < A, to turn it into an absolute probability), and then 
searching the subarray p[m] [i] to find the smallest entry that 
is larger than the scaled value (and if there are several such 


Algorithm 2 Simple linear search 
1: code chunk (search the table of partial sums): 
2 : let u random value chosen from [0.0,1.0) 

3: let stop -^r- sum x u 

4: while j < K — 1 and stop > p[TO][i][j] do 
5: j ^ j + 1 

6 : end while 
7: end 


Algorithm 3 Simple binary search 


code chunk (search the table of partial sums): 
let u random value chosen from [0.0,1.0) 
let stop -^r- sum x u 
let A: ^ A - 1 

while j < A; do 

■ ^ , j + k 

let mid ^ ^— 

if stop < p[TO[[i][TOi(i] then 
k •<— mid 

else 

j •<— mid + 1 

end if 
end while 
end 


entries, all equal, then the one with the smallest index is cho¬ 
sen); the index j of that entry is used as the desired randomly 
chosen integer. A simple linear search (Algorithm]^ can do 
the job. But because all elements of 9 and (p are nonnegative, 
the products in a are also nonnegative, and so each subar¬ 
ray p[TO][i[ is monotonically nondecreasing; that is, for all 
0 < TO < M, 0 < i < A[to], and 0 < A; < A, we have 
p[to][i][A: — 1] < p[TO][i][A:]. Therefore a binary search (Al¬ 
gorithmic can be used instead, which is faster, on average, 
for A sufficiently large ||8] exercise 6.2.1-5]. 

3. Blocking and Transposition 

Anticipating certain characteristics of the hardware, we 
make some commitments as to how the algorithm will be 
executed. We assume that arrays are laid out in row-major 
order (as they are when using C or CUDA). Let IL be a 
machine-dependent constant (typically 16 or 32, but for now 
we do not require that kk be a power of 2). For purposes of 
illustration we assume Ik = 8 and also A = 19. We divide 
the documents into groups of size W and assume that M is 
an exact multiple of W. (In an overall application, the set 
of documents can be padded with empty documents so as 
to make M be an exact multiple of W without affecting the 
overall behavior of the algorithm on the “real” documents.) 
We split the outermost loop of Algorithm[T](with index vari¬ 
able to) into two nested loops with index variables q and r, 
from which the equivalent value for to is then computed. We 
commit to making the loop with index variable i sequential. 




















Algorithm 4 Drawing 2 ; values (transposed access) 

1 : procedure DRAwZ(iV[M], 0[M, AT], X], 

2: r(;[M][7V]; output 2 ;[M, A”]) 

3 : local array p[M,K],a [M, W ] 

4: for all 0 < g < M/W do 
5: local array Cwarp [W,W] 

6 : for SIMD 0 < r < ly do 

7: let TO ^ g X VF + r 

8: local array 6»iocai[^] 

9: (cache 9 values into 6*iocai) 

10 : let ^master ^ 0 

11: while a?7t/(Zmaster < A[to]) dO 

12: let i ^ TOm(iinaster, N[m] - 1) 

13: (compute partial sums of 9-(j) products) 

14: let j 0 

15: (search the table of partial sums) 

16: z[m,i\-^j 

^master ^ ^master “f 1 

18 : end while 

19: end for 

20 : end for 
21 : end 


to treating the iterations of the loop on q as independent (and 
therefore possibly parallel), and to treating the iterations of 
the loop on r as executed by a SIMD “thread warp” of size 
W, that is, parallel and implicitly lock-step synchronized. 
As a result, we view each of the M documents as being pro¬ 
cessed by a separate thread. A beneht of making the loop on 
i sequential is that the array p can be made two-dimensional 
and non-ragged, having size M x K. We fuse the loop that 
computes O-cj) products with the loop that computes partial 
sums; this eliminates the need for the array a, but instead (for 
reasons explained below) we retain a as a two-dimensional, 
non-ragged array of size M x W that is used only when 
K > W. Within the loop controlling index variable q we 
declare a local work array c^arp of size W x W that will 
be used to exchange information by the W threads within a 
warp; our eventual intent is that this array will reside in GPU 
registers. We cache values from the array 6* in a per-thread 
array 0iocai of length K, anticipating that such cached values 
will reside in a faster memory and be used repeatedly by the 
loop on i. 

There is, however, a subtle problem with the loop con¬ 
trolling index variable i: the upper bound N[m] for the loop 
variable may be different for different threads. As a result, in 
the last iterations it may be that some threads have “gone to 
sleep” because they reached their upper loop bound earlier 
than other threads in the warp. This is undesirable because, 
as we shall see, we rely on all threads “staying awake” so 
that they can assist each other. Therefore, we rewrite the 
loop control to use a “master index” idiom and exploit the 
trick of allowing a thread to perform its last iteration (with 


Algorithm 5 Caching 6 values (transposed access) 
1: code chunk (cache 6 values into 0iocai): 

2 : let j 0 

3: while j < {K mod W) do 

4: 6»iocal[j] ^ 6 »[to, j] 

5 : i^j + l 

6 : end while 
7: while j < A do 

8: for k from 0 through kF — 1 do 

9: > Next line uses transposed access to 9 

10 : 9ioca.i[j+ k]^9[qxW+ k,j+ r] 

11: end for 

12 : j ^ j + W 

13: end while 

14: end 


i = N[m] — 1) multiple times, which doesn’t work for many 
algorithms but is acceptable for LDA Gibbs. 

The result of all these code transformations is Algo¬ 
rithm ffl which makes use of three code chunks; Algo- 
rithmsj^l^ and either Algorithm[^or[^ Algorithms]^ and 
besides using SIMD thread warps of size W to process doc¬ 
uments in groups of size W, also process topics in blocks 
of size W. This allows the innermost loops to process “lit¬ 
tle” arrays of size W x W. If K (the number of topics) is 
not a multiple of W, then there will be a remnant of size 
K mod W. To make looping code slightly simpler, we put 
the remnant at the front of each array, rather than at the end. 
For VF = 8 and K = 19, topics 0, 1, and 2 form the remnant; 
topics 3-10 form a block of length 8; and topics 11-18 form 
a second block. This organization of arrays into blocks al¬ 
lows reduction of the cost of accessing data in main memory 
by performing transposed accesses. 

The simplest use of transposed memory access occurs in 
Algorithm]^ For every document, this algorithm fetches a 9 
value for every topic. The topics are regarded as divided into 
a leading remnant (if any) and then a sequence of blocks 
of length W. The while loop on lines [3|6] handles the rem¬ 
nant, and then the following while loop processes successive 
blocks. On line [^within the inner loop, note that the ref¬ 
erence is to 9[q X W + k,j + r] rather than the expected 
9[q X W + r,j + k] (which would be the same as 9[m, j + k] 
because m = q x W + r). The result is that when the W 
threads of a SIMD warp execute this code and all access 9 si¬ 
multaneously, they access W consecutive memory locations, 
which can typically be fetched by a hardware memory con¬ 
troller much more efficiently than W memory locations sep¬ 
arated by stride K. Another way to think about it is that on 
any given single iteration of the loop on lines IHUTTI (which 
overall is designed to fetch one W x W block of 9 values) 
instead of every thread in the warp fetching its fcth value 
from the 9 array, all the threads work together to fetch all W 
values that are needed by thread k of the warp. Each thread 
then stores what it has fetched into its local copy of the ar- 











Algorithm 6 Compute partial sums (transposed access) 


Algorithm 7 Drawing new z values using a butterfly table 


1 

code chunk (compute partial sums of 9-(p products) : 

1 

procedure DrawZ{N[M], 9[M, K],4>[V, K], 

2 

let c •(— to [m] [i] 

2 

w[M][N]; output z[M,TV]) 

3 

for all 0 < fc < IT do 

3 

for all 0 < g < M/W do 

4 

Cwarp [^, c] ^ c > Transposed access to Cwarp 

4 

for SIMD 0 < r < IT do 

5 

end for 

5 

let m ^ <7 X IT -1- r 

6 

let sum ^ 0.0 

6 

local array p[K\,9ioca\[K] 

7 

let j ^ 0 

7 

register array a[Ik], Cwarp[bk] 

8 

while j < {K mod W) do 

8 

(cache 9 values into 0iocai) 

9 

sum <r- sum + (6»localb'] X (p[c,j]) 

9 

let ^master 0 

10 

p[m, j] <— sum 

10 

while any{imasteT < N[m]) do 

11 

3 ^ .7 + 1 

11 

let i ^ mm(7niaster, N[m\ - 1) 

12 

end while 

12 

(SIMD compute butterfly partial sums) 

13 

while j < K do 

13 

let j ^ 0 

14 

for k from 0 through IT — 1 do 

14 

(SIMD search butterfly partial sums) 

15 

> Next line uses transposed access to (p 

15 

z[m,i] ^ j 

16 

a[m,k]^ ^'locaib + k]x ^[cwarpb, k],j + r] 

16 

^master ^ ^master H" 1 

17 

end for 

17 

end while 

18 

for k from 0 through IT — 1 do 

18 

end for 

19 

> Next line uses transposed access to a 

19 

end for 

20 

sum sum + a[q x IT -|- fc, r] 
p[m,j + k] -It- sum 

end for 

20 

end 

21 

22 

array cp into blocks (and possibly a remnant) and perform 

23 

j ^ j -1- IT 

transposed accesses to ip. To do this, each thread needs to 

24 

end while 

know what part of the array (p every other thread is interested 

25 

end 

in; this is done through the IT x IT local work array Cwarp ■ In 


ray 0iocai- The result is that each thread doesn’t really have 
all the 9 data it needs to process its document; pictorially, 
data in each W x W block has been transposed. The real 
story, however, is that memory for local arrays in a GPU is 
not stored in the same way as memory for global arrays. The 
global array 6 is laid out in row-major order in main memory 
like this: 

where each row corresponds to a 
document and each column to a 
topic; but each of the local arrays 
is laid out like this: 


line|^ each thread figures out which word is the ith word of 
its document and calls it c; in linesj^Sit then stores its value 
for c into every element of row r of the array Cwarp- This is 
not an especially fast operation, but it pays for itself later on. 
The loop in lines IMT^ computes 9-(j) products and partial 
sums p in the usual way (remember that the remnant in 9local 
is not transposed), but the loop in lines flrUlT] processes a 
block to compute product values to store into the a array; the 


access to ip on line 16 is transposed (note that the accesses 


■ ■■ jr; where each row corresponds to an array index 
III .. b and each column corresponds to a thread. In 
!!! !!e each diagram, the locations in a row are con- 
■■■ tiguous in memory; it is really the fact that we 
... .. L choose to index 0iocai by topic and to assign 
!!! !!e each document to a thread that causes “trans- 
position” to occur. In any case. Algorithm 

-' is coded so that every set of W simultaneous 

(SIMD) memory accesses refers to W consecutive memory 
locations, so it runs much faster than a “nontransposed” 
version would; and the result is that each thread ends up with 
data that other threads need. 

One possible remedy is to have the threads exchange data 
so that each thread has exactly the 9 values it needs for 
the rest of the computation. Instead, we compensate for the 
transposition of 9 in Algorithm The idea is to divide the 


to 9iocai and c^jarp are not transposed; because they were 
constructed and stored in transposed form, normal fetches 
cause their values to line up correctly with the (p values 
obtained by a transposed fetch). So this is pretty good; but in 
line 1^ we finally pay the piper: in order to have the finally 
computed partial sums p reside in the correct thread for the 
binary search, it is necessary to perform a transposed access 
to a on line|^ but a is a local array, so transposed accesses 
are bad rather than good, and this occurs in an inner loop, so 
performance still suffers. 

4. Butterfly-patterned Partial Sums 

We can avoid the cost of the final transposition of a by not 
requiring the partial sums table p for each thread to be en¬ 
tirely in the local memory of that thread. Instead, we arrange 
for the threads to “help each other” during the binary search, 
in much the same way that each thread computes 9-(p prod¬ 
ucts that are actually of interest to other threads. Moreover, 
we avoid computing the entire set of partial sums; instead 












































Algorithm 8 Compute a butterfly-patterned table of sums 

1 

code chunk (SIMD compute butterfly partial sums): 

2 

let c •(— zu [m] [z] 


3 

for all 0 < fc < VF do 


4 

Cwarp]^] ^ shujfle{c, k) 


5 

end for 


6 

let sum ^ 0.0 


7 

let 4^0 


8 

while j < {K mod W) do 


9 

sum ^ sum + (6»iocaib'] x (j)[c,j]) 

10 

p[j] ■<— sum 


11 

3 ^ .7 + 1 


12 

end while 


13 

while j < K do 


14 

for k from 0 through kF — 1 do 


15 

> Next line uses transposed access to (j) 

16 

^local[j k] X ^[c-vyarp 

[k],j + r] 

17 

end for 


18 

for b from 0 through (log 2 W) — 

1 do 

19 

let bit ^ 2^ 


20 

for z from 0 through 2 ^bit ~ ^ 

do 

21 

let d ^ 2 X bit X i + {bit — 

1 ) 

22 

let h ^ (if (to & bit) ^ 0 


23 

then a[d] 


24 

else a[d + bit]) 


25 

let V ^ shujfleXor{h, bit) 


26 

if (r & bit) ^ 0 then 


27 

a[d] ^ a[d -1- bit] 


28 

end if 


29 

a[d + bit] •(— a[d] + v 


30 

pIj + d] ^ a[d] 


31 

end for 


32 

end for 


33 

sum ^ sum -f a[kF — 1] 


34 

p[W — 1] ^ sum 


35 

j ^ j + W 


36 

end while 


37 

end 



(this is our novel contribution) we compute a “butterfly- 
patterned” table of partial sums that is sufficient to recon¬ 
struct any needed partial sum on the fly during the binary 
search process. This makes each step of the binary search 
process slower, but a binary search of a block of W entries 
examines only log 2 W elements of the block. 

Our final version is Algorithm]?] It is quite similar to Al¬ 
gorithm]^ but declares all local arrays in such a way as to 
be thread-local (and specifies that arrays a and c^^arp are ex¬ 
pected to reside in registers). It uses the code chunk in Al- 
gorithm]^to cache 6* values in Oiocai, and also uses two new 
code chunks; Algorithm]^creates a butterfly-patterned table 
of partial sums, and Algorithm]^ uses this table to perform 
the binary search. For this algorithm to work properly, W 
must be a power of 2. 
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Figure 1. On the left, a table of partial sums for 8 threads 
and 19 topics; on the right, a butterfly-patterned table of 
partial sums, in which each thread holds data of interest to 
other threads 


To save space in our figures, we introduce a special abbre¬ 
viation: ■ ?] = ^ (l>[w[rn][i],k]. 

(The variable z is a free parameter of this notation; when we 
use the notation, it will consistently refer to a computation in 
one loop iteration during which i is unchanging.) The large 
number m identifies a thread that owns a document, and the 
subscript p and superscript q identify a subarray a[p: q\-, the 
symbol denotes the sum over that subarray. 

For purposes of illustration, we shall (as before) assume 
W = % and K — 19, so that the remnant is of size 3 and 
there are two blocks of size 8; we also assume M = 8. 

For each i, Algorithm ]^ uses Algorithm ]^ to compute, 
for each m and k, p[m,k] = m§. For our example, this 
produces the p table shown on the left-hand side of Figure]^ 
each column corresponds to a warp thread and contains all 
partial sums needed by that thread. But Algorithm ]^ uses 
Algorithm ]^ to more cheaply compute “butterfly-patterned 
partial sums” as shown on the right-hand side of Figure ]T] 
Not all data needed by a thread is in the column for that 
thread; for example, some data needed by thread 2 appears in 
columns 0, 4, and 6. Note that all the data in the remnant and 
in the last line of each block is in the column for the thread to 
which it belongs—in fact, these entries are identical to those 
computed by Algorithm]^ The tables differ only in the first 
W — 1 rows of each block. 

To see why this table can be called “butterfly-patterned,” 
consider the processing of a single block. We regard a block 




















of p as first being initialized from a block of O-cj) products 
in the W x W local array a —which, recall, is stored in 
transposed form. (To simplify this part of the illustration, 
we will assume that the rows of the block are numbered 
(indexed) from 0 to kk — 1; it is as if the remnant has size 
zero and we are examining the hrst block.) 

The “butterfly” portion of the algorithm sweeps over the 
array p in a specihc order, and at each step operates on 
four entries within p that are at the intersection of two rows 
whose indices differ by a power of 2 and two columns whose 
indices differ by that same power of 2. Suppose the four 


values in those entries are 


a 

b' 

c 

d _ 


; they are replaced by 


a 

d 

a + b 

c + d 


. (It might seem more natural mathemati¬ 


cally to replace the four values with 


a 

c 

a-\-h 

c + d 


, and 


that strategy also leads to a working algorithm, but on the 
NVIDIA Titan GPU, at least, that computation turns out to 
be noticeably more expensive, for reasons related to the pre¬ 
cise instruction sequences required—this butterfly-patterned 
computation occurs in the innermost loop, where the inclu¬ 
sion of even one extra instruction can significantly decrease 
performance.) 

Now, it must be admitted that if we examine the pattern in 
which data is transferred between rows and columns during 
one of these replacement computations, we see that it is not 
the conventional “tX” butterfly design: 

a b ^ d 



To see more precisely why we use the phrase “butterfly- 
patterned,” it is helpful to consider a three-dimensional di¬ 
agram in which the vertical axis is time, the horizontal axis 
spans columns, and the front-to-back axis spans rows: 



The top plane of this diagram is labeled with the state of 
four entries before the replacement computation, the bottom 
plane is labeled with the state of four entries after the re¬ 
placement computation, and the six arrows show the full pat¬ 
tern of transfer between the top plane and the bottom plane 
(including data that remains in place, not moving in space 
but being carried forward in time). The previous diagram is 


a vertical projection of this diagram (that is, what the 3-D 
diagram looks like when seen from above). 

The next diagram is a front-to-back projection of this 
diagram (that is, what the 3-D diagram looks like when seen 
from the front): 


a,c b,d 



and here we clearly see the standard “M” butterfly pattern 
as data is carried forward in time within each column and 
also exchanged between the two columns. Similarly, the next 
diagram is a horizontal projection of the 3-D diagram (that 
is, what the 3-D diagram looks like when seen from the side): 


a,b c,d 



and once again we clearly see the standard “N” butterfly 
pattern as data is carried forward in time within each row 
and also exchanged between the two rows. 

We will use the symbol j; fc, 1]” to indicate appli¬ 
cation of this replacement computation to rows i and j and 
columns k and I —that is, to the four entries at positions 
[i,k], [i,l], [j,k], and [j,l]. In our example with W = 8, 
these replacement computations are performed: 


7^[0,l;0,l] 

7^[0,1;2,3] 

7^[0,1;4,5] 

mi; 6,7] 

7^[2,3;0,1] 

7^[2,3;2,3] 

7^[2,3;4,5] 

m,3;6,7] 

7^[4,5;0,1] 

7^[4,5;2,3] 

7^[4,5;4,5] 

m,5;6,7] 

7^[6,7;0,l] 

7^[6,7;2,3] 

7^[6,7;4,5] 

m:7;6,7] 

7^[1,3;0,2] 

77[1,3;1,3] 

7^[1,3;4,6] 

m:3;5,7] 

7^[5,7;0,2] 

7^[5,7;l,3] 

7^[5,7;4,6] 

77[5,7;5,7] 

7^[3,7;0,4] 

77[3,7;1,5] 

7^[3,7;2,6] 

77(3:7; 3,7] 


where replacements within a set between horizontal lines are 
independent and therefore may be done sequentially or in 
parallel, but each set must be completed before beginning 
computation of the replacements in the group below it. Fig- 
urej^shows the progress of this process with snapshots after 
each set of replacements is performed. In the general case, 
there are log 2 W such sets of replacements. 

After all replacements have been done on a block of size 
W X W, the entry in row i and column j contains the value 
Uy where TO = i0(i-|-l), k = [^J, u = {iSz^m) + {jSzm), 
V = j Sz {^k), and w = v + k. We use the symbol “0’ 
to indicate the bitwise negation (that is, NOT) of an integer 






























Algorithm 9 Searching within a butterfly-patterned table 
1: code chunk (SIMD search butterfly partial sums): 

2: let u ■>r- random value chosen from [0.0,1.0) 

3: let stop •(— sum x u 

4: let searchBase ^ {K mod W) + {W — 1) 

5: let j ^ 0 
6 : let 

7: > Binary search to find correct block of size W 

8 : while j < fc do 

9: let mid ^ 

10: if stop < p 

11: k mid 

12 : else 

13: j mid + 1 

14: end if 

15: end while 

16: let hlockBase ^ {K mod W) + j x W 

17: ii K >W then 

18: (butterfly search one block) 

19: end if 

20: if blockBase > 0 then 
21: if stop < p[m, blockBase — 1] then 

22: > Not in a block after all, so search remnant 

23: for i from 0 through {K mod W) — 1 do 

24: if stop < p[i] then 

25: j ^ i 

26: break 

27: end if 

28 : end for 

29: end if 

30: end if 

31: end 


j + k 
2 

mid X W + searchBase] then 


Algorithm 10 Butterfly search within one W xW block 
1 : code chunk (butterfly search one block): 

2 : let lowValue ^ (if blockBase > 0 
3: then p[blockBase — 1] 

4: else 0) 

5: let highValue p[blockBase + {W — 1)] 

6: let flip ^ 0 

7: \> Butterfly search within the block of size W 

8 : for b from 0 through (log 2 W) — 1 do 

9: let ^ 

10 : let mask <— {{W — 1) x (2 x bit)) & {W — 1) 

11 : let 2/^0 

12 : for i from 0 through 2 '^bit ~ 1 do 

13: let d {bit — 1) + 2 X bit x i 

14: let him ■<— (d & mask) -|- (r & —^mask) 

15: let hisBlockBase ■<— shujfle{blockBase, him) 

16: let t ^ shuffleXor{p[hisBlockBase -I- (l\,flip) 

17: if ((r 0 d) & mask) = 0 then 

18: y^t 

19: end if 

20 : end for 

21 : let compareValue ^ (if (r & bit) 7 ^ 0 

22: then highValue — y 

23: else lowValue + y) 

24: if stop < compare Value then 

25: highValue •<— compareValue 

26: flip ^ flip (B {bit Sz r) 

27: else 

28: lowValue •<— compareValue 

29: flip flip 0 {bit Sz —'r) 

30: end if 

31: end for 

32: j ^ blockBase + {flip 0 r) 

33: end 


represented in binary form; we also use the symbol to 
indicate the the bitwise conjunction (that is, AND) of two 
binary integers, and the symbol “0” to indicate the bitwise 
exclusive OR (that is, XOR) of two binary integers. 

Algorithm performs this computation. The function 
shuffle is used in line|^to broadcast values from each thread 
of a warp to all the others, and the function shuffleXor 
is used in line to exchange values between pairs of 
threads whose thread numbers differ by a power of 2. These 

are precisely the CUDA intrinsic functions_ shfl and 

_shfl_xor miioi. 

Within a butterfly-patterned block of partial sums, Algo- 
rithm|^performs a binary search as follows. The stop value 
is computed exactly as in Algorithm and a block to be 
searched is identified by performing a binary search on the 
subarray consisting of just the last row of each block; this 
identifies a specific block to search. If iT > W, then there 
is at least one block, and it is searched, but it is possible that 
the desired stop value does not lie within that block; in that 
case, the remnant is searched using a linear search. 


In order to search within a block, Algorithm[T0|maintains 
two additional state variables lowValue and highValue. An 
invariant is that if thread m has indices j through kofa block 
still under consideration, then lowValue = 
and highValue = ^ order to cut the search 

range in half, the binary search needs to compare the stop 
value to the midpoin^alue TOq^ 


blockBase+rmd ^ 


j + k 
2 


J; in Algorithm 


this value is of course an entry in the 
p array, namely p[m, mockBase + mid], but in Algorithm[To| 
the midpoint value is calculated by choosing an appropri¬ 
ate entry from the butterfly-patterned p array and then ei¬ 
ther adding it to lowValue or subtracting it from highValue. 
Whether to add or subtract on iteration number b (where the 
log 2 W iterations are numbered starting from 0) depends on 
whether bit b (counting from the right starting at 0) of the 
binary representation of m is 0 or 1, respectively. Depend¬ 
ing on the result of the comparison of the midpoint value 
with the stop value, the midpoint value is assigned to either 
lowValue and highValue, maintaining the invariant. Also 





















Figure 2. Generation of butterfly-patterned partial sums for a VF x W block in three steps, each using a set of four-element 
replacement computations 



Number of topics K 


Figure 3. Execution time (K = 32k -|- 16,0 < A; < 7) 


depending on the result of the comparison of the midpoint 
value with the stop value, a bit of a third state variable flip 
(initially 0) is updated. When the binary search is complete, 
the correct index to select is computed from the value in flip. 

A normal binary search effectively walks down a binary 
decision tree, where each node of the tree is labeled with a 
(normal) partial sum; but Algorithm [T0| walks down a tree 
that is labeled with entries from the butterfly-patterned table 
of partial sums. For example, refenlng to the first seven rows 
of the upper block in the right-hand diagram in Figure[T] the 
entries relevant to thread 5 form this tree: 

Starting with low Value = 5g and 
highValue = 5q°, lines 1814311 of 
Algorithm can walk down the 
tree, at each node calculating a 
new midpoint by adding the node 
label to low Value or subtracting it 
from highValue as appropriate. 

The threads in a warp assist one another in fetching 
these tree nodes using the loop in lines [1211^ the function 
shujfleXor effects this data transfer in linefTb] 



5. Evaluation 

We coded four versions of a complete LDA Gibbs-sampler 
topic-modeling algorithm in CUDA for an NVIDIA Titan 
Black GPU (W = 32). For each version we tested two vari¬ 
ants, one using Algorithmic (using the binary search of Al¬ 
gorithm [^1 and one using Algorithm |C All eight variants 
were tested for speed using a Wikipedia-based dataset with 
number of documents M = 43556, vocabulary size V = 
37286, total number of words in corpus = 3072662 
(therefore average document size {JiN)/M « 70.5), and 
maximum document size maxW = 307. Each variant was 
measured using eight different values for the number of top¬ 
ics K (16, 48, 80, 112, 144, 176, 208, and 240), in each 
case performing 100 sampling iterations and measuring the 
execution time of the entire application, not just the part that 
draws z values. Best performance requires unrolling three 
loops in Algorithmic we had to manually unroll the loop 
that starts on linefTCand the CUDA compiler then automat¬ 
ically unrolled the loops that start on line s [T4| and The 
performance results are shown in Figure [CThe butterfly 
variants are faster for K > 80. For K > 200, for each of the 
four versions the butterfly variant is more than twice as fast. 

6. Related Work 

Because the computed probabilities are relative in our FDA 
application, it is necessary to compute all of them and then 
to compute, if nothing else, their sum, so that the relative 
probabilities can be effectively normalized. Therefore every 
method for drawing from a discrete distribution represented 
by a set of relative probabilities involves some amount of 
preprocessing before drawing from the distribution. The var¬ 
ious algorithms in the literature have differing tradeoffs ac¬ 
cording to what technique is used for preprocessing what 
technique is used for drawing; some algorithms also accom¬ 
modate incremental updating of the relative probabilities by 
providing a technique for incremental preprocessing. 

Instead of doing a binary search on the table of par¬ 
tial sums, one can instead (as Marsaglia na observes 
in passing) construct a search tree using the principles 
of Huffman encoding 0 (independently rediscovered by 

























Zimmerman 123) to minimize the expected number of com¬ 
parisons. In either case the complexity of the search is 
0(log n), but the optimized search may have a smaller con¬ 
stant, obtained at the expense of a preprocessing step that 
must sort the relative probabilities and therefore have com¬ 
plexity n(nlogn). 

Walker lUSl [H] describes what is now known as the 
“alias” method, in which n relative probabilities are prepro¬ 
cessed into two additional tables F and A of length n. To 
draw a value from the distribution, let fc be a integer cho¬ 
sen uniformly at random from — 1} and let u 

be chosen uniformly at random from the real interval [0,1). 
Then the value drawn from the distribution is 

if M < Ffc then k else 

Therefore, once the tables F and A have been produced, 
the complexity of drawing a value from the distribution is 
0(1), assuming that the cost of an array access is 0(1). 
Walker’s method im for producing the tables F and A 
requires time 0(n^); it is easy to reduce this to 17 (n log n) by 
sorting the probabilities Q exercise 3.4.1-7] and then using, 
say, priority heaps instead of a list for the intermediate data 
structure. Either version heuristically attempts to minimize 
the probability of having to access the table A. 

Vose lEl describes a preprocessing algorithm, with 
proof, that further reduces the preprocessing complexity of 
the alias method to 0(n). The tradeoff that permits this im¬ 
provement is that the preprocessing algorithm makes no at¬ 
tempt to minimize the probability of accessing the array A. 

Matias et al. ifTSll describe a technique for preprocessing 
a set of relative probabilities into a set of trees, after which 
a sequence of intermixed generate (draw) and update opera¬ 
tions can be performed, where an update operation changes 
just one of the relative probabilities; a single generate op¬ 
eration takes 0(log* n) expected time, and a single update 
operation takes 0(log* n) amortized expected time. 

Li et al. Qo) describe a modified LDA topic modeling al¬ 
gorithm, which they call Metropolis-Hastings-Walker sam¬ 
pling, that uses Walker’s alias method but amortizes the 
cost of constructing the table by drawing from the same 
table during multiple consecutive sampling iterations of a 
Metropolis-Hastings sampler; their paper provides some jus¬ 
tification for why it is acceptable to use a “slightly stale” 
alias table (their words) for the purposes of this application. 

7. Conclusions 

The technique of constructing butterfly-patterned partial 
sums appears to be best suited for situations where a SIMD 
processor is used to compute tables of relative probabil¬ 
ities for multiple discrete distributions, each of which is 
then used just once to draw a single value, and where each 
thread, when computing its table, must fetch data from a 
contiguous region of memory whose address is computed 
from other data. The LDA application for which we devel¬ 


oped the technique has these characteristics. The technique 
uses transposed memory access in order to allow a SIMD 
memory controller to touch only one or two cache lines on 
each fetch, then cheaply constructs a set of partial sums that 
are just adequate to allow partial sums actually needed to be 
constructed on the fly during the course of a binary search. 
For a complete LDA Gibbs-sampler topic-modeling algo¬ 
rithm coded in CUDA for an NVIDIA Titan Black GPU and 
already tuned as best we could for high performance, the 
butterfly-patterned partial-sums technique further improves 
the speed of the overall application by at least a factor of 2 
when the number of topics is greater than 200. 
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